Skip to content

[datafusion] Allow callers to skip default-database init in register_catalog#345

Merged
jerry-024 merged 6 commits into
apache:mainfrom
shyjsarah:fix/configurable-default-db
May 29, 2026
Merged

[datafusion] Allow callers to skip default-database init in register_catalog#345
jerry-024 merged 6 commits into
apache:mainfrom
shyjsarah:fix/configurable-default-db

Conversation

@shyjsarah
Copy link
Copy Markdown
Contributor

Purpose

SQLContext::register_catalog always probes/creates a "default" database and calls set_current_database("default") on the first catalog. This breaks for non-admin principals who lack DESCRIBE / CREATEDATABASE on default: they cannot construct a SQLContext at all, even when their queries target databases they do have access to (via fully-qualified catalog.db.table names). The error surfaces at session init as e.g. Forbidden: ... DESCRIBE on DATABASE default.

This change mirrors Java FlinkCatalog's defaultDatabase constructor parameter and DISABLE_CREATE_TABLE_IN_DEFAULT_DB option in Rust.

Brief change log

  • Add SQLContext::register_catalog_with_default_db(name, catalog, default_db: Option<&str>): Some(name) ensures name exists (creates if missing) and sets it as current on the first registered catalog; None skips both the get_database probe and the implicit set_current_database — callers are expected to use fully-qualified table names
    or call set_current_database later.
  • register_catalog(name, catalog) now delegates with Some("default") — fully backwards-compatible.
  • Python binding (pypaimon_rust): new default_database keyword on SQLContext.register_catalog. Convention chosen so omitting the kwarg keeps the historical behavior: omitted / NoneSome("default") (back-compat); ""None (skip init); "name"Some("name") (custom default).

Tests

Added ProbeTrackingCatalog which counts get_database / create_database calls and returns Error::Unsupported (not DatabaseNotExist) from get_database, so failing
to skip the probe surfaces as a hard error — mimicking the Forbidden condition. Three unit tests:

  • register_catalog_with_none_skips_default_db_probeNone results in 0 calls to get_database / create_database; registration succeeds; current catalog is set on the
    session.
  • register_catalog_with_some_default_propagates_probe_errorSome("default") runs the probe once and surfaces the underlying error.
  • register_catalog_default_wrapper_uses_default_db — bare register_catalog() still delegates with Some("default") (back-compat).

All existing tests (175 in paimon-datafusion) continue to pass. cargo check -p paimon-datafusion -p pypaimon_rust is clean.

API and Format

New public method SQLContext::register_catalog_with_default_db. Existing SQLContext::register_catalog signature and behavior unchanged. Python SQLContext.register_catalog gains an optional default_database kwarg; existing call sites are unaffected. No storage-format changes.

Documentation

Rustdoc on both methods explains the default_db semantics and the non-admin-principal motivation. Python kwarg's docstring documents the three-value convention. No external docs update needed.

shyjsarah added 3 commits May 27, 2026 18:02
`SQLContext::register_catalog` always probes/creates a "default" database
and `set_current_database("default")` on the first catalog. This breaks
for non-admin principals who lack DESCRIBE / CREATEDATABASE on `default`:
they can never construct a `SQLContext` even when their actual queries
target databases they do have access to.

Add `register_catalog_with_default_db(catalog_name, catalog, default_db)`
where `default_db: Option<&str>`:

- `Some(name)` — ensure `name` exists (create if missing), set as current
  on first registration. Same behavior as before, just configurable.
- `None` — skip the default-database probe/create AND the implicit
  `set_current_database`. Callers are expected to use fully-qualified
  names or call `set_current_database` later.

This mirrors Java `FlinkCatalog`'s `defaultDatabase` constructor
parameter and `DISABLE_CREATE_TABLE_IN_DEFAULT_DB=true` option.

Existing `register_catalog(name, catalog)` delegates with
`Some("default")` — fully backwards compatible.

Python binding exposes the same control via a new `default_database`
keyword. Convention chosen to keep `None`/omitted as the historical
behavior:

  - omitted / `None`      -> Some("default")  (back-compat)
  - `default_database=""` -> None             (skip init)
  - `default_database=X`  -> Some(X)          (custom default)
Squash the wall-of-text doc blocks down to the load-bearing lines.
No code change.
ProbeTrackingCatalog records get/create_database calls and returns
Unsupported (not DatabaseNotExist) from get_database — so failing to
skip the probe surfaces as a hard error rather than the silent
create-on-missing path the existing MockCatalog allows.

Three cases:
  - None  -> neither get nor create_database is called; registration succeeds
  - Some  -> Unsupported error from get_database propagates
  - bare register_catalog() -> delegates with Some("default") (probes)
Copy link
Copy Markdown
Contributor

@jerry-024 jerry-024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean design, backward-compatible, and solves a real pain point for non-admin principals.

A few optional suggestions (none blocking):

  1. Document table-function behavior when default_db = None (sql_context.rs:163)
    register_table_functions(&self.ctx, &catalog, default_db.unwrap_or("default")) still resolves bare table names against "default". Consider adding a note in the doc comment: "When default_db = None, table functions (vector_search, full_text_search) require fully-qualified table names."

  2. Guard against Some("") at the Rust API level
    The Python binding handles empty-string → None conversion, but other Rust callers could pass Some("") directly, which would call get_database("") / create_database(""). A one-liner validation would make the API more robust:

    if matches!(default_db, Some("")) {
        return Err(DataFusionError::Plan("default_db must not be empty".into()));
    }
  3. Optional test: cover default_db = None + bare table function call (e.g. vector_search('unqualified', ...)) to document the expected error path.

All three are nit-level — happy to see them addressed here or in a follow-up.

…ack doc, integration test

Three optional review comments, all addressed:

1. Doc note on built-in TVF behavior with default_db=None: bare table
   names inside vector_search / full_text_search still resolve against
   the literal "default" namespace (the fallback in register_table_functions).
   Callers using None must qualify table names in those TVF calls.

2. API-level guard against Some(""). Python binding already maps "" → None
   to express "skip init", but raw Rust callers could pass Some("") directly
   and silently probe get_database(""). Reject at the boundary with a Plan
   error telling them to use None instead.

3. New integration-shaped test
   `register_catalog_with_none_table_function_resolves_bare_name_to_literal_default`
   documents the expected error path: bare TVF name + default_db=None →
   surfaces an error naming `default.<bare_name>` (proving the fallback fired).

Plus a unit test for the empty-string guard
(`register_catalog_with_some_empty_string_is_rejected`).

5 tests total now, all green.
@shyjsarah
Copy link
Copy Markdown
Contributor Author

LGTM — clean design, backward-compatible, and solves a real pain point for non-admin principals.

A few optional suggestions (none blocking):

  1. Document table-function behavior when default_db = None (sql_context.rs:163)
    register_table_functions(&self.ctx, &catalog, default_db.unwrap_or("default")) still resolves bare table names against "default". Consider adding a note in the doc comment: "When default_db = None, table functions (vector_search, full_text_search) require fully-qualified table names."
  2. Guard against Some("") at the Rust API level
    The Python binding handles empty-string → None conversion, but other Rust callers could pass Some("") directly, which would call get_database("") / create_database(""). A one-liner validation would make the API more robust:
    if matches!(default_db, Some("")) {
        return Err(DataFusionError::Plan("default_db must not be empty".into()));
    }
  3. Optional test: cover default_db = None + bare table function call (e.g. vector_search('unqualified', ...)) to document the expected error path.

All three are nit-level — happy to see them addressed here or in a follow-up.

Thanks for the review! All three nits addressed in c3971e6:

#1 Doc note added on register_catalog_with_default_db re: the TVF fallback: bare names in vector_search / full_text_search still resolve against literal "default"
when default_db = None, so callers must qualify table names in those TVF calls.

#2 Added the Some("") guard at the top of register_catalog_with_default_db — returns DataFusionError::Plan("default_db must not be empty; pass None to skip default-database init"). Doesn't affect the Python binding (already maps "" → None there); this is purely a footgun guard for raw Rust callers. New unit test
register_catalog_with_some_empty_string_is_rejected covers it.

#3 Added register_catalog_with_none_table_function_resolves_bare_name_to_literal_default — registers with None, runs SELECT * FROM vector_search('bare', ...), asserts the error names both default and bare (proving the fallback namespace fired against the MockCatalog's TableNotExist).

Test count: 3 → 5, all green (cargo test -p paimon-datafusion --lib register_catalog). cargo check -p paimon-datafusion -p pypaimon_rust clean. cargo fmt applied.

@shyjsarah shyjsarah closed this May 28, 2026
@shyjsarah shyjsarah reopened this May 28, 2026
shyjsarah added 2 commits May 28, 2026 19:26
…VF fallback doc, integration test"

This reverts commit c3971e6.
…guard, TVF fallback doc, integration test""

This reverts commit 1727c47.
Copy link
Copy Markdown
Contributor

@jerry-024 jerry-024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@jerry-024 jerry-024 merged commit 73cf7ca into apache:main May 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants